NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Using BlueField-3 SmartNICs to Offload Vector Operations in Krylov Subspace Methods

https://doi.org/10.1109/HiPC62374.2024.00029

Suresh, Kaushik Kandadi; Michalowicz, Benjamin; Contini, Nick; Ramesh, Bharath; Abduljabbar, Mustafa; Shafi, Aamir; Subramoni, Hari; Panda, Dhabaleswar (December 2024, IEEE)

Full Text Available
Demystifying the Communication Characteristics for Distributed Transformer Models

https://doi.org/10.1109/HOTI63208.2024.00020

Anthony, Quentin; Michalowicz, Benjamin; Hatef, Jacob; Xu, Lang; Abduljabbai, Mustafa; Shafi, Aamir; Subramoni, Hari; Panda, Dhabaleswar K (August 2024, IEEE)

Full Text Available
Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs

https://doi.org/10.1109/HOTI59126.2023.00020

Michalowicz, Benjamin; Suresh, Kaushik Kandadi; Subramoni, Hari; Panda, Dhabaleswar_K; Poole, Steve (August 2023, IEEE)

Full Text Available
DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs

https://doi.org/10.1145/3569951.3593595

Michalowicz, Benjamin; Kandadi Suresh, Kaushik; Subramoni, Hari; Panda, Dhabaleswar; Poole, Steve (July 2023, Practice and Experience in Advanced Research Computing 23)

Full Text Available
Accelerating communication with multi‐HCA aware collectives in MPI

https://doi.org/10.1002/cpe.7879

Tran, Tu; Ramesh, Bharath; Michalowicz, Benjamin; Abduljabbar, Mustafa; Subramoni, Hari; Shafi, Aamir; Panda, Dhabaleswar K. (July 2023, Concurrency and Computation: Practice and Experience)

Summary To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, also referred to as HCAs (Host Channel Adapters), resulting in a “multi‐rail”/“multi‐HCA” network. For example, the ThetaGPU system at Argonne National Laboratory (ANL) has eight adapters per node; with this many networking resources available, utilizing all of them becomes non‐trivial. The Message Passing Interface (MPI) is a dominant model for high‐performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we provide a thorough performance analysis of existing multirail solutions and their implications on collectives and present the necessity for further enhancement. Specifically, we propose novel designs for hierarchical, multi‐HCA‐aware Allgather. The proposed designs fully utilize all the available network adapters within a node and provide high overlap between inter‐node and intra‐node communication. At the micro‐benchmark level, we see large inter‐node improvements up to 62% and 61% better than HPC‐X and MVAPICH2‐X for 1024 processes. Because Allgather is used in Ring‐Allreduce, our designs also improve its performance by 56% and 44% compared to HPC‐X and MVAPICH2‐X, respectively. At the application level, our enhanced Allgather shows and improvement in a matrix‐vector multiplication kernel when compared to HPC‐X and MVAPICH2‐X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2‐X.
more » « less
Full Text Available
In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences

https://doi.org/10.1109/IPDPSW59300.2023.00065

Michalowicz, Benjamin; Suresh, Kaushik Kandadi; Ramesh, Bharath; Shafi, Aamir; Subramoni, Hari; Abduljabbar, Mustafa; Panda, Dhabaleswar (May 2023, 25th Workshop on Advances in Parallel and Distributed Computational Models)

Full Text Available
A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs

https://doi.org/10.1109/IPDPS54959.2023.00022

Suresh, Kaushik Kandadi; Michalowicz, Benjamin; Ramesh, Bharath; Contini, Nick; Yao, Jinghan; Xu, Shulei; Shafi, Aamir; Subramoni, Hari; Panda, Dhabaleswar (May 2023, 37th IEEE International Parallel & Distributed Processing Symposium)

Full Text Available
Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters

https://doi.org/10.1109/HiPC56025.2022.00025

Suresh, Kaushik Kandadi; Guptha, Akshay Paniraja; Michalowicz, Benjamin; Ramesh, Bharath; Abduljabbar, Mustafa; Shafi, Aamir; Subramoni, Hari; Panda, Dhabaleswar (December 2022, 29th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC))

Full Text Available
Experiences with Porting the FLASH Code to Ookami, an HPE Apollo 80 A64FX Platform

https://doi.org/10.1145/3503470.3503478

Feldman, Catherine; Michalowicz, Benjamin; Siegmann, Eva; Curtis, Tony; Calder, Alan; Harrison, Robert (January 2022, HPCAsia 2022 Workshop: International Conference on High Performance Computing in Asia-Pacific Region Workshops)

Full Text Available
Comparing the behavior of OpenMP Implementations with various Applications on two different Fujitsu A64FX platforms

https://doi.org/10.1145/3437359.3465592

Michalowicz, Benjamin; Raut, Eric; Kang, Yan; Curtis, Tony; Chapman, Barbara; Oryspayev, Dossay (July 2021, PEARC '21: Practice and Experience in Advanced Research Computing)
null (Ed.)
The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current world’s fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and RIKEN.
more » « less
Full Text Available

« Prev Next »

Search for: All records